About the Data

Data on San Francisco Crimes is made public as part of San Francisco Open Data project. The data is available for download at the following link: https://data.sfgov.org/Public-Safety/SFPD-Incidents-from-1-January-2003/tmnf-yvry. The data is updated periodically. My data set contains entries up to 14 December 2015.

I downloaded the data as a csv file. The data without any processing has the following structure:

## 'data.frame':    1853494 obs. of  13 variables:
##  $ IncidntNum: int  151080775 151083020 151083020 151080747 151083020 151080719 151081701 151080753 151085587 151080731 ...
##  $ Category  : Factor w/ 39 levels "ARSON","ASSAULT",..: 26 16 26 22 2 22 33 37 5 26 ...
##  $ Descript  : Factor w/ 913 levels "ABANDONMENT OF CHILD",..: 674 477 688 894 362 839 792 768 192 674 ...
##  $ DayOfWeek : Factor w/ 7 levels "Friday","Monday",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Date      : Factor w/ 4730 levels "01/01/2003","01/01/2004",..: 4526 4526 4526 4526 4526 4526 4526 4526 4526 4526 ...
##  $ Time      : Factor w/ 1439 levels "00:01","00:02",..: 1420 1420 1420 1420 1420 1413 1410 1410 1410 1400 ...
##  $ PdDistrict: Factor w/ 10 levels "BAYVIEW","CENTRAL",..: 4 4 4 1 4 3 6 5 8 4 ...
##  $ Resolution: Factor w/ 17 levels "ARREST, BOOKED",..: 12 12 12 12 12 1 12 12 12 12 ...
##  $ Address   : Factor w/ 24863 levels "0 Block of 10TH AV",..: 4255 10688 10688 717 10688 20344 8779 14644 2731 13049 ...
##  $ X         : num  -122 -122 -122 -122 -122 ...
##  $ Y         : num  37.8 37.8 37.8 37.7 37.8 ...
##  $ Location  : Factor w/ 36998 levels "(37.7078790224135, -122.463626254961)",..: 23020 19954 19954 8438 19954 5197 19094 30913 30588 18020 ...
##  $ PdId      : num  15108077503074 15108302026148 15108302003051 15108074715161 15108302004170 ...

Since Location has the same information as columns X and Y, I will remove it. Also, PdId is not useful here as it’s just an ID. I will keep IncidentNum because there are situations where there are multiple entries in the data set for the same event. This variable should be used when counting the number of incidents, as it provides more accurate results than counting the rows. I will also convert Date and Time columns to a Date-Time format.

Now the structure of the data is:

## 'data.frame':    1853494 obs. of  10 variables:
##  $ IncidntNum: int  151080775 151083020 151083020 151080747 151083020 151080719 151081701 151080753 151085587 151080731 ...
##  $ Category  : Factor w/ 39 levels "ARSON","ASSAULT",..: 26 16 26 22 2 22 33 37 5 26 ...
##  $ Descript  : Factor w/ 913 levels "ABANDONMENT OF CHILD",..: 674 477 688 894 362 839 792 768 192 674 ...
##  $ DayOfWeek : Factor w/ 7 levels "Friday","Monday",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ PdDistrict: Factor w/ 10 levels "BAYVIEW","CENTRAL",..: 4 4 4 1 4 3 6 5 8 4 ...
##  $ Resolution: Factor w/ 17 levels "ARREST, BOOKED",..: 12 12 12 12 12 1 12 12 12 12 ...
##  $ Address   : Factor w/ 24863 levels "0 Block of 10TH AV",..: 4255 10688 10688 717 10688 20344 8779 14644 2731 13049 ...
##  $ X         : num  -122 -122 -122 -122 -122 ...
##  $ Y         : num  37.8 37.8 37.8 37.7 37.8 ...
##  $ DateTime  : POSIXct, format: "2015-12-14 23:40:00" "2015-12-14 23:40:00" ...

Summary of the data

I’ll print a summary of all the columns.

##    IncidntNum                  Category     
##  Min.   :     3979   LARCENY/THEFT :376511  
##  1st Qu.: 60281536   OTHER OFFENSES:264504  
##  Median : 90762729   NON-CRIMINAL  :196765  
##  Mean   : 91532925   ASSAULT       :162231  
##  3rd Qu.:126094840   VEHICLE THEFT :112592  
##  Max.   :991582377   DRUG/NARCOTIC :110431  
##                      (Other)       :630460  
##                                   Descript           DayOfWeek     
##  GRAND THEFT FROM LOCKED AUTO         : 131074   Friday   :282645  
##  LOST PROPERTY                        :  66735   Monday   :256490  
##  BATTERY                              :  57432   Saturday :267530  
##  STOLEN AUTOMOBILE                    :  57081   Sunday   :245820  
##  DRIVERS LICENSE, SUSPENDED OR REVOKED:  56531   Thursday :264416  
##  WARRANT ARREST                       :  49638   Tuesday  :264358  
##  (Other)                              :1435003   Wednesday:272235  
##       PdDistrict                 Resolution     
##  SOUTHERN  :332471   NONE             :1123428  
##  MISSION   :251169   ARREST, BOOKED   : 436281  
##  NORTHERN  :224074   ARREST, CITED    : 154543  
##  BAYVIEW   :187582   LOCATED          :  34433  
##  CENTRAL   :182490   PSYCHOPATHIC CASE:  29151  
##  TENDERLOIN:169560   UNFOUNDED        :  20796  
##  (Other)   :506148   (Other)          :  54862  
##                      Address              X                Y        
##  800 Block of BRYANT ST  :  55792   Min.   :-122.5   Min.   :37.71  
##  800 Block of MARKET ST  :  14268   1st Qu.:-122.4   1st Qu.:37.75  
##  2000 Block of MISSION ST:  10297   Median :-122.4   Median :37.78  
##  1000 Block of POTRERO AV:   8722   Mean   :-122.4   Mean   :37.77  
##  900 Block of MARKET ST  :   6693   3rd Qu.:-122.4   3rd Qu.:37.78  
##  0 Block of TURK ST      :   6529   Max.   :-120.5   Max.   :90.00  
##  (Other)                 :1751193                                   
##     DateTime                  
##  Min.   :2003-01-01 00:01:00  
##  1st Qu.:2006-03-04 12:00:00  
##  Median :2009-07-04 13:50:00  
##  Mean   :2009-07-06 20:40:40  
##  3rd Qu.:2012-11-27 13:03:45  
##  Max.   :2015-12-14 23:40:00  
## 

This summary offers some insight, but not an overall picture of the data set. There are many categories for each feature, so we can only see the categories with the most elements. For instance, we can instantly see that the most common crime is theft, from description we can see that the most common is “GRAND THEFT FROM LOCKED AUTO”. Southern district seems to be the district with the highest criminal activity, and there is an unusually high number of crimes at the 800 Block of BRYANT ST. I will explore those insights in more detail through visualizations.

Univariate Analysis

I’ll take a look at each individual variable.

Main features of interest

The most interesting variables are Category and Resolution. All other variables describe the context of the crime, but most important is the crime itself, and the resolution. In a sense those are the dependent variables. I would like to describe those variables in terms of distribution in time, space, in order to see trends of criminal behavior and police behavior.

Have certain crimes became more common than others? Did the SFPD solved more or less crimes at different periods of time?

Category

The number of crimes in each category:

The most common crime is theft, followed by Other Offenses and Non-Criminal. Those are vague general categories. The next most common specific crimes are Assault, Vehicle Theft and Drug/Narcotic. The categories with least frequent occurances are hard to see, so I’ll rescale the count to a logarithmic axis.

Resolution

I’ll explore the distribution of resolutions:

I find it interesting that most crimes have no resolution, followed by a high number of arrests. I’ll rescale this plot to a logarithmic scale, so we can better see the variability among the categories with low number of cases:

Day of week

Crimes by day of week:

There doesn’t seem to be noticeable variability among days of week.

District

Crimes by district:

The differences among different districts are noticeable, there is wide variability among districts with respect to the criminal activity.

Those plots provide an overview of all the data, but I would like to see more specific trends, like what crimes are most likely to be resolved, or how the criminal activity varies across the time, and across different areas of the city.

Crimes over time

How do crimes vary across time? Is there a general increasing or decreasing trend? How do they vary by month? Or by hour? Are there fluctuating patterns? We saw that by day of week there doesn’t seem to be a pattern.

To answer the question regarding the trend, I will plot the number of crimes for each month.

There seems to be a yearly pattern. We can better observe this if we plot all yearly periods on top of each other.

The red line represents the mean of monthly crimes from all years. The thin lines represent the number of crimes for each year. There might be some trend, but it doesn’t look conclusive.

How about hourly trends? I suspect there should be a lot of variability among hours. I’ll plot all the points in order to visualize the variability as well as the mean. I’ll limit the y axis to (0, 20) and superimpose the average as a red line.

Interesting pattern, not a lot of crimes happen during the night, but the highest number of them happen around midnight.

Addresses

I’ll compute the average daily frequency of crimes for addresses, and print the top 10 addresses:

## Source: local data frame [10 x 5]
## 
##                     Address no_crimes         X        Y percentage
##                      (fctr)     (int)     (dbl)    (dbl)      (dbl)
## 1    800 Block of BRYANT ST     47666 -122.4034 37.77542 10.0752484
## 2    800 Block of MARKET ST     10802 -122.4071 37.78465  2.2832382
## 3  1000 Block of POTRERO AV      7237 -122.4065 37.75644  1.5296977
## 4  2000 Block of MISSION ST      5975 -122.4196 37.76422  1.2629465
## 5    900 Block of MARKET ST      4730 -122.4090 37.78327  0.9997886
## 6         0 Block of 6TH ST      4037 -122.4093 37.78144  0.8533080
## 7      16TH ST / MISSION ST      4016 -122.4197 37.76505  0.8488692
## 8        0 Block of TURK ST      3986 -122.4099 37.78336  0.8425280
## 9     300 Block of ELLIS ST      3484 -122.4120 37.78496  0.7364194
## 10 100 Block of OFARRELL ST      3399 -122.4072 37.78655  0.7184528

It seems that 800 Block of Bryant Street is top of the list by far. Around 10 crimes per day on average! A quick look at the map reveals that this is the address of the San Francisco Police Officers Association (http://www.sfpoa.org/).

Now I’d like to view the top 100 addresses on the map:

Now I’m surprised that most of them are in Tenderloin, not in Southern, as we saw in the distribution of crimes by district. But Tenderloin is mode central, therefore when we account for the density of population, this makes sense.

Districts on map

I’d like to see the districts, as they are recorded in this data set. I’ll plot all crimes on the map, and color them according to the districts.

We can see here that Tenderloin is the smallest district, but it is in a high density area. This is the district with the highest density of crimes. Southern district has the most crimes recorded.

Description

I’d like to see what’s in the Descript variable.

As there are over 900 different descriptions, I would explore this variable in relation with the category. It can be used to group the data as a tree structure, by using Category as the main classifier and Descript for subcategories.

Multivariate analysis

Resolution vs. no resolution by category

I’d like to see which categories are most likely to be resolved.

This plot shows the percentage of solved crimes by category. But it doesn’t take into account the number of crimes in each category. I will combine this plot with the initial plot of categories distribution.

This shows the relation between Category and crimes that had any resolution at all. But I would like to view the relation between each crime and each resolution.

This looks interesting, most categories follow the pattern of resolution distribution we saw previously, with no resolution most of the time, followed by arrests. Exceptions from this pattern are Non-Criminal, which is highly correlated with psychopatic case, Missing Person and Runaway. Most common resolutions for the latter 2 categories are Located or None. From the resolutions point of view, Psychopatic Case and Located have an unusual distribution across categories. Located is highly correlated with Missing Person and Runaway. Also, Psychopatic Case has an unusual distribution.

Category vs District

I’d like to see the distribution of Categories by Districts.

I decided to make two plots, because I wanted to rescale the color gradient for less common crimes, in order to better visualize differences among less common crimes.

Tenderloin is the most peculiar. It looks like the most common crime there is Drug/Narcotic. It seems that some crimes are concentrated in a couple of districts, like Prostitution, or Runaway, even Theft, which is most common, seems to be more common in some districts than in others.

This is really interesting, but I’d like to see the evolution in time of categories and resolutions across districts.

Categories over time

We can see interesting patterns here. Theft increased steadily starting from 2010 or 2011, there is an interesting rise from 2008 to 2010 for Other Offenses. Non-Criminal increased from 2010 and Drug/Narcotic decreased.

I wonder if SFPD changed the way they record some crimes, maybe crimes that were recorded as Drug/Narcotic are now recorded as Non-Criminal.

Categories by district over time

I would like to visualize the most common crimes by district.

As we saw earlier, most drug crimes happen in Tenderloin, but now we can see that there was high activity between 2008 and 2010. At the present moment, this doesn’t seem to be the most common crime there anymore. Theft became more common in the last years, but here we see that Theft increased a lot only in a few districts, like Central, Northern and Southern, and only slightly in others, like Richmond and Park, while in other districts remained about the same.

Resolution Percentage Over Time

Another interesting question is how did the percentage of crimes that had a resolution evolved over time. We saw which crimes are more likely to be resolved, we saw which crimes are more likely to be resolved one way or another, but how did all this changed over time? Did the resolution percentage evolved differently in different districts?

In 2015, the percentage of crimes that had a resolution decreased significantly. When I split by district, it’s clear that Tenderloin had most crimes with resolution.

Density of crimes

I saw how the most common crimes evolved over time, now I’d like to see where those crimes happen more often, and whether the center of some crimes moved over time.

It seems that vehicles are stolen from anywhere in the city. For drugs and other crimes, on the other hand, it looks like there are a few large centers. I’ll explore the drug crimes now.

Indeed, there is a large center and a few smaller centers. I’ll plot also all drug related events as points, with very thin colors, for comparison:

So there is activity everywhere in the city, but on the heat map we can see only the areas with really high density of drugs related crimes.

Now I’d like to see if the center moved during the years. I’ll group the data by years, and I’ll plot the heat map for every year.

The story this plot is telling is that there are a couple of large drug markets, that remained at about the same location. The largest drug market is in Tenderloin, followed by Mission, Park and Bayview. The smaller ones seem to have expanded, then decreased, then expanded again. Although the number of drug incidents decreased in the last years, the events seem to be more dispersed in 2012-2015 (similar to 2005-2007) than in 2008-2011.

Final Plots

First Plot

There seems to be a pattern here. Crimes that can be reported after the event, like theft, assault, vandalism, burglary, are less likely to be solved. Crimes that have a history, like cases where the police has a warrant, or crimes where the police intervenes directly at the moment the crime is commiter, like drug crimes, or drunkedness cases, are most likely to be resolved. We can also see that the most common crime is theft, followed by other small crimes, labeled as other offenses or non-criminal. Assault is the 4th most common crime, it’s the most common violent crime. Slightly less than 50% of assault cases are solved.

Second Plot

This plot shows us the difference between the amount of crimes that had resolution in different districts, over time. I chosed this plot because we can see a striking difference between the percentage of crimes solved in different districts. The district with most crimes solved is Tenderloin, with the percentage of crimes solved being between 60-80% for a long period of time. This percentage droped to about 50% in 2014-2015. We then have Mission and Bayview, with 40-50% of crime solved. At the other end of the spectrum we have Richmond and Central districts with 20-30% of crimes solved. We can also see a decrease in the percentage of crimes solved overall in the last 2 years.

Third plot

For the final plot I chosed the heatmap of drug crimes, but I zoomed in on Tenderloin, as it is the area with the most drug crimes, and I wanted to have a look at crimes activity at street level. From this plot we can see the streets where most drug crimes are reported. We can see that a lot of drug activity happens on Ellis Street, mostly at the corner of Jones Street. Turk Street is also very hot in terms of drug related events, from Hyde Street (also a very busy street) all the way down to Market Street. On the other side of Market Street, 6th Street is also a very active place regarding drug activity.

There are on average 7.62 drug crimes daily only in Tenderloin. Out of those, on Turk Street, there are 1.83, and on Ellis Street, 1.26. It looks like the police visits those streets quite often.

Interesting fact, at the bottom-right corner we can see the address of SFPOA. A lot of crimes are reported at that address, but I’m not entirely sure if those particular reports are accurate.

Reflections

One of the difficulties I encountered was when I needed to plot information that required some preprocessing of the data, for example when I wanted to plot the percentage of crimes that had or had not a resolution. I used the dplyr library for this, but when I wanted to extend the plot to include another variable, I had to do some more preprocessing. This is ok, but my confusion comes from a lack of experience with a grammar of graffics approach to plotting. I feel that ggplot’s style of sending the whole data to the plotting library and selecting what to plot and how is useful most of the time, because it allows you to quickly create relatively complex plots, but at other times it’s confusing because I’m not sure if a plot can be created with ggplot directly or I need to process the data first. But I assume this gets easier after more practice. The approach I had until now was to prepare the data first (for example using matplotlib in python), so it was more a challenge of finding out the best way to create the specific plot in R.

The most puzzling surprise I encountered was when I found that the address where most crimes were reported was the address of the San Francisco Police Officers Association. And not only it is at top of the list, but there are over 4 times more crimes reported at that address than there are at the next most common address. I checked those addresses on Google Street View, but they looked like regular city streets. I suspect that some crimes that had no address reported were automatically assigned to this address, but I can’t test this hypothesis. Otherwise it would be quite odd that on average 10 crimes happen daily just accross the street from the police office. I assume there are always police officers around there, so it’s easy for them to intervene if something is happening there.

Summary and Future Work

Overall the experience of going through this analysis was very interesting. I think this analysis sheds some light on what crimes are most common, where they are most common, what types of crimes are most likely to be resolved.

I think further research can be made by investigating more specific trends, by using a similar approach I used for Drug crimes. This analysis shows only a general overview of the data, but there is definetly a lot more hidden information in this data set. For instance, we may be interested in all crimes involving teenagers (those cases are labeled with juvenile either in resolutions or in descriptions). We can then explore different categories and trends within this subset of the data. We may be interested only in vehicle thefts, or violent crimes, there is a lot of information in this data set that can be explored.